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Abstract 

Neural network algorithms have been recently applied to construct Parton 
Distribution Function (PDF) parametrizations which provide an alternative 
to standard global fitting procedures. We propose a technique based on an 
interactive neural network algorithm using Self- Organizing Maps (SOMs). 
SOMs are a class of clustering algorithms based on competitive learning 
among spatially-ordered neurons. Our SOMs are trained on selections of 
stochastically generated PDF samples. The selection criterion for every opti- 
mization iteration is based on the features of the clustered PDFs. Our main 
goal is to provide a fitting procedure that, at variance with the standard 
neural network approaches, allows for an increased control of the systematic 
bias by enabling user interaction in the various stages of the process. 
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1 Introduction 

Modelling experimental data always introduces bias, in the form of either a 
theoretical or systematical bias. The former is introduced by researchers with 
the precise structure of the model they use, which invariably constrains the 
form of the solutions. The latter form of bias is introduced by algorithms, 
such as optimization algorithms, which may favour some results in ways 
which are not justified by their objective functions, but rather depend on the 
internal operation of the algorithm. 

In this paper we concentrate on high energy hadronic interactions, which 
are believed to be described by Quantum Chromodynamics (QCD). Because 
of the properties of factorization and asymptotic freedom of the theory, the 
cross sections for a wide number of hadronic reactions can be computed us- 
ing perturbation theory, as convolutions of perturbatively calculable hard 
scattering coefficients, with non perturbative Parton Distribution Functions 
(PDFs) that parametrize the large distance hadronic structure. The extrac- 
tion of the PDFs from experiment is inherently affected by a bias, which 
ultimately dictates the accuracy with which the theoretical predictions can 
be compared to the high precision measurements of experimental observables. 
In particular, the form of bias introduced by PDFs will necessarily impact 
the upcoming searches of physics beyond the Standard Model at the Large 
Hadron Collider (LHC). This situation has in fact motivated an impressive 
body of work, and continuous, ongoing efforts to both estimate and control 
PDFs uncertainties. 

Currently, the estabhshed method to obtain the PDFs is the global analysis, 

a fitting procedure, where initial scale Qo ~ IGeV < Q^li ansatzc, as a 
function of the momentum fraction x, for each parton flavour i in hadron h are 
evolved to higher scales according to the perturbative QCD rcnormalization 
group equations. All the available observables e.g. the proton structure 
function, ^2(0;, Q^), are composed of the candidate PDFs and comparison 
with the data is made with the help of some statistical estimator such as the 
global x^ 
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where the error matrix Vij consists of the statistical and uncorrelated system- 
atic errors, as well as of the correlated systematic errors when available. The 
parameters in the ansatze are then adjusted and the whole process repeated 
until a global minimum has been found. 

The modern PDF collaborations (CTEQ [1] and references within , MRST [2- 
4], Alekhin [5, 6], Zeus [7] and HI [8]) also provide error estimates for the PDF 
sets. They all rely on some kind of variant of the Hessian method (see e.g. 
[9] for details) , which is based on a Taylor expansion of the global around 
it's minimum. When only the leading terms are kept, the displacement of 
can be written in terms of Hessian matrix if^, which consists of second 
derivatives of with respect to the parameter displacements, evaluated at 
the minimum. The error estimate for the parameters themselves, or for any 
quantity that depends on those parameters, can then be obtained in terms 
of the inverse of the Hessian matrix. 



For details of PDF uncertainty studies see e.g. Refs. [10, 11]. 
The global analysis combined with Hessian error estimation is a powerful 
method, allowing for both extrapolation outside the kinematical range of the 
data and extension to multivariable cases, such as nuclear PDFs (nPDFs) 
[12-15]. In principle, when more data become available, the method could 
also be applied to Generalized Parton Distributions (GPDs), for which only 
model-dependent [16] or semi model-dependent [17, 18] solutions presently 
exist. 

However, there are uncertainties related to the method itself, that are difficult 
to quantify, but may turn out to have a large effect. Choosing global as a 
statistical estimator may not be adequate since the minimum of the global fit 
may not correspond to a minima of the individual data sets, and as a result 
the definition of Ax^ may be ambiguous. Estimates for the current major 
global analyses are that Ax^ = 50 — 100 is needed to obtain a ~ 90% confi- 
dence interval [1,2]. In principle this problem could be avoided by using the 
Lagrange multiplier method (sec e.g. [19]), which docs not assume quadratic 
behaviour for the errors around the minimum, instead of the Hessian method. 





2 



but this is computationally more expensive solution. Introducing a functional 
form at the initial scale necessarily introduces a parametrization dependence 
bias and theoretical assumptions behind the fits, such clS 5, 5, C quark con- 
tent, details of the scale evolution (e.g. higher order perturbative corrections, 
large/small x resummation), higher twists etc. as well as the data selection 
and treatment, e.g. kinematical cuts, all reflect into the final result of the 
analysis. Also, there may be systematical bias introduced by the optimiza- 
tion algorithm. The differences between the current global PDF sets tend to 
be larger than the estimated uncertainties [20], and these differences again 
translate to the predictions for the LHC observables, such as Higgs [21] or 

and Z production cross sections [1]. 
A new, fresh approach to the PDF fitting has recently been proposed by 
NNPDF collaboration [22, 23] who have replaced a typical functional form 
ansatz with a more complex standard neural network (NN) solution and the 
Hessian method with Monte Carlo sampling of the data (see the references 
within [23] for the nonsinglet PDF fit and the details of the Monte Carlo 
sampling) . 




Figure 1: Schematic diagram of a feed- forward neural network, from [24]. 



Neural network can be described as a computing solution that consists of 
interconnected processing elements, neurons, that work together to produce 
an output function. In a typical feed- forward NN (see Fig. 1) the output is 
given by the neurons in the last layer, as a non-linear function of the output 
of all neurons in the previous layer, which in turn is a function of the output 
of all neurons in the previous layer, and so on, starting from the first layer, 
which receives the input. For a NN with L layers and ni neurons in each 
layer, the total number of the parameters is J2f=i {''^I'^i+i + '^z+i)- 
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In the beginning of the NNPDF fitting procedure a Monte Carlo sample 
of replicas of the experimental data is generated by jittering the central 
values of the data withing their errorbars using univariate Gaussian (or some 
other distribution if desired) random numbers for each independent error 
source. The number of the replicas is made so large that the Monte Carlo 
set of replicas models faithfully the probability distribution of the original 
data. For each rephca a Genetic Algorithm (GA) fit is performed by first 
setting the NN parameters for each parton fiavour to be fitted randomly, 
then making clones of the set of parameters, and mutating each of them 
randomly (multiple mutations). After scale evolution the comparison with 
the data is performed for all the clones, and the best clones are selected for 
a source of new clones, and the process repeated until the minimum for the 
has been found. Overfitting of the data is prevented by using only part of 
the data in the minimizing procedure, and using the other part to monitor 
the behaviour of the x^. When fitting PDFs one thus ends up with A^j-ep 
PDF sets, each initial scale parton distribution parametrized by a different 
NN. The quahty of the global fit is then given by the computed from the 
averages over the sample of trained neural networks. The mean value of the 
parton distribution at the starting scale for a given value of x is found by 
averaging over the replicas, and the uncertainty on this value is the variance 
of the values given by the replicas. 

The NNPDF method circumvents the problem of choosing a suitable Ax^, 
and it relies on GA which works on a population of solutions for each MC 
replica, thus having a lowered possibility of getting trapped in local minima. 
NN parametrizations are also highly complex, with large number of param- 
eters, and thus unbiased compared to the ansatze used in global fits. The 
estimated uncertainties for NNPDF fits are larger than those of global fits, 
possibly indicating that the global fit uncertainties may have been underesti- 
mated. It should, however, be pointed out that the MC sampling of the data 
is not not tied to use of NNs, and it thus remains undetermined whether the 
large uncertainties would persist if the MC sampling was used with a fixed 
functional form. The complexity of NN results may also induce problems, es- 
pecially when used in a purely automated fitting procedure. Since the effect 
of modifying individual NN parameters is unknown, the result may exhibit 
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strange or unwanted behaviour in the extrapolation region, or in between 
the data points if the data is sparse. In such a case, and in a case of incom- 
patible data, the overfitting method is also unsafe to use. Implementation of 
information not given directly by the data, such as nonperturbative models, 
lattice calculations or knowledge from prior work in general, is also difficult 
in this approach. 

A possible method of estimating the PDF uncertainties could also be pro- 
vided by Bayesian statistical analysis, as preliminarily studied in [25, 26] and 
explained in [27], in which the errors for the PDF parameters, or for an 
observable constructed from the PDFs, are first encapsulated in prior prob- 
abilities for an enlarged set of model parameters, and posterior distributions 
are obtained using computational tools such as Markov Chain Monte Carlo. 
Similar to NNPDF approach, this method allows for an inclusion of non- 
Gaussian systematic errors for the data. 

In this introductory paper we propose a new method which relies on the 
use of Self-Organizing Maps (SOMs), a subtype of neural network. The 
idea of our method is to create means for introducing "Researcher Insight" 
instead of "Theoretical bias". In other words, we want to give up fully 
automated fitting procedure and eventually develop an interactive fitting 
program which would allow us to "take the best of both worlds" , to combine 
the best features of both the standard functional form approach and the 
neural network approach. In this first step, we solely concentrate on single 
variable functions, free proton PDFs, but discuss the extension of the model 
to multivariable cases. In Section 2 we describe the general features of the 
SOMs, in Sections 3 and 4 we present two PDF fitting algorithms relying 
on the use of SOMs and finally in Section 5 we envision the possibihties the 
SOM method has to offer. 

2 Self-Organizing Maps 

The SOM is a visualization algorithm which attempts to represent all the 
available observations with optimal accuracy using a restricted set of mod- 
els. The SOM was developed by T. Kohonen in the early 1980's ([28], see 
also [29]) to model biological brain functions, but has since then developed 
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into a powerful computational approach on it's own right. Many fields of 
science, such as statistics, signal processing, control theory, financial analy- 
ses, experimental physics, chemistry and medicine, have adopted the SOM 
as a standard analytical tool. SOMs have been applied to texture discrimi- 
nation, classification/pattern recognition, motion detection, genetic activity 
mapping, drug discovery, cloud classification, and speech recognition, among 
others. Also, a new application area is organization of very large document 
collections. However, applications in particle physics have been scarce so 
far, and mostly directed to improving the algorithms for background event 
rejection [30-32]. 

SOM consists of nodes, map cells, which arc all assigned spatial coordinates, 
and the topology of the map is determined by a chosen distance metric Mmap- 
Each cell i contains a map vector Vi, that is isomorphic to the data samples 
used for training of the neural network. In the following we will concentrate 
on a 2-dimensional rectangular lattice for simplicity. A natural choice for the 
topology is then Li{x,y) = Yl'i=i ~ Vil^ which also has been proved [33] 
to be an ideal choice for high-dimensional data, such as PDFs in our case. 
The implementation of SOMs proceeds in three stages: 1) initiahzation of 
the SOM, 2) training of the SOM and 3) associating the data samples with 
a trained map, i.e. clustering. During the initialization the map vectors are 
chosen such that each cell is set to contain an arbitrarily selected sample of 
either the actual data to be clustered, or anything isomorphic to them (see 
Fig. 2 for an example). The actual training data samples, which may be 
e.g. subset or the whole set of the actual data, are then associated with map 
vectors by minimizing a similarity metric M^ata- We choose M^ata — -^i- The 
map vector each data sample becomes matched against, is then the most 
similar one to the data sample among all the other map vectors. It may 
happen that some map vectors do no not have any samples associated with 
them, and some may actually have many. 

During the training the map vectors are updated by averaging them with the 
data samples that fell into the cells within a given decreasing neighbourhood, 
see Fig. 3. This type of training which is based on rewarding the winning 
node to become more like data, is called competitive learning. The initial 
value of a map vector Vi at SOM cell i then changes during the course of 
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Figure 2: A 2D grid SOM which cells get randomly associated with the type of 
data samples we would like to study, such as nonsinglet PDFs or observables. At 
this stage each cell gets associated with only one curve, the map vector. 

training as 

V,(t + 1) = V,(t) (1 - w(t) Nj,,(t)) + Sj(t) w(t) Nj,,{t) (3) 

where now Vi{t + 1) is the contents of the SOM cell i after the data sample 
Sj has been presented on the map. The neighbourhood, the radius, within 
which the map vectors are updated is given by the function A^j centered 
on the winner cell j. Thus even the map vectors in those cells that didn't 
find a matching data sample are adjusted, rewarded, to become more like 
data. Typically Nj^i{t) — e~^""^p^^''^^ /'^^*\ where r{t) is a monotonously de- 
creasing radius sequence. In the beginning of the training the neighbourhood 
may contain the whole map and in the end it just consists of the cell itself. 
Moreover, the updating is also controlled by w{t), which is a monotonously 
decreasing weight sequence in the range [0, 1]. 

As the training proceeds the neighbourhood function eventually causes the 
data samples to be placed on a certain region of the map, where the neigh- 
bouring map vectors are becoming increasingly similar to each other, and 
the weight sequence w{t) furthermore finetunes their position. 
In the end on a properly trained SOM, cells that are topologically close to 
each other will have map vectors which are similar to each other. In the final 
phase the actual data is matched against the map vectors of the trained map, 
and thus get distributed on the map according to the feature that was used 
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Figure 3: (Colour online) Each data sample Si is associated with the one map 
vector Vi it is most similar to. As a reward for the match, the winning map vector, 
as well as its neighbouring map vectors, get averaged with the data associated 
with the winning cell. 

as Mdata- Clusters with similar data now emerge as a result of unsupervised 
learning. 

For example, a map containing RGB colour triplets would initially have 
colours randomly scattered around it, but during the course of training it 
would evolve into patches of colour which smoothly blend with each other, 
see Fig. 4. This local similarity property is the feature that makes SOM 
suitable for visualization purposes, thus facilitating user interaction with the 
data. Since each map vector now represent a class of similar objects, the SOM 
is an ideal tool to visualize high-dimensional data, by projecting it onto a 
low-dimensional map clustered according to some desired similar feature. 
SOMs, however, also have disadvantages. Each SOM, though identical in 
size and shape and containing same type of data, is different. The clusters 
may also split in such a way that similar type of data can be found in sev- 
eral different places on the map. We are not aware of any mathematical 
or computational means of detecting if and when the map is fully trained, 
and whether there occurs splitting or not, other than actually computing the 
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Figure 4: (Colour online) SOM containing RGB colour triplets getting trained. 
Adapted from Ref. [34] 

similarities between the neighbouring cells and studying them. 
In this work we use the so-called batch-version of the training, in which all 
the training data samples are matched against the map vectors before the 
training begins. The map vectors are then averaged with all the training 
samples within the neighbourhood radius simultaneously. The procedure is 
repeated iVstep (free parameter to choose) times such that in every training 
step the same set of training data samples is associated with the evolving 
map and in Eq.(3) t now counts training steps. When the map is trained, 
the actual data is finally matched against the map vectors. In our study 
our training data are always going to be the whole data we want to cluster, 
and the last training step is thus the clustering stage. The benefit of the 
batch training compared to the incremental training, described earlier, is 
that the training is independent of the order in which the training samples 
are introduced on the map. 

3 MIXPDF algorithm 

In this approach our aim is to both i) to be able to study the properties of 
the PDFs in a model independent way and yet ii) to be able to implement 
knowledge from the prior works on PDFs, and ultimately iii) to be able to 
guide the fitting procedure interactively with the help of SOM properties. 
At this stage it is important to distinguish between the experimental data 
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and the training data of the SOM. When we are referring to measured data 
used in the PDF fitting, such as F2 data, wc always call it experimental data. 
The SOM training data in this study is going to be a collection of candidate 
PDF sets, produced by us, or something composed of them. A PDF set in 
the following will always mean a set of 8 curves, one for each independent 
parton flavour / = {g,Uv,dy,u,d,s = s,c = c and 6 = 6 in this simplified 
introductory study), that arc properly normalized such that 



In order to proceed we have to decide how to create our candidate PDF 
sets, decide the details of the SOMs, details of the actual fitting algorithm, 
experimental data selection and details of the scale evolution. 
In this introductory paper our aim is not to provide a finalised SOMPDF set, 
but rather to explore the possibilities and restrictions of the method we are 
proposing. Therefore we refrain from using "all the possible experimental 
data" as used in global analyses, but concentrate on DIS structure function 
data from HI [35], BCDMS [36,37] and Zeus [38], which we use without ad- 
ditional kinematical cuts or normalization factors (except rejecting the data 
points below our initial scale). The parameters for the DGLAP scale evolu- 
tion were chosen to be those of CTEQ6 (CTEQ6L1 for lowest order (LO)) 
[39], the initial scale being Qo = 1.3 GeV. In next-to-leading order (NLO) 
case the evolution code was taken from [40] (QCDNUM17 beta release). 
We will start now with a simple pedagogical example, which we call MIXPDF 
algorithm, where we use some of the existing PDF sets as material for new 
candidate PDFs. At first, we will choose CTEQ6 [39], CTEQ5 [41], MRST02 
[2,42], Alekhin [5] and GRV98 [43] sets and construct new PDF sets from 
them such that at the initial scale each parton fiavour in the range x — 
[10~^, 1] is randomly selected from one of these five sets (we set the heavy 
flavours to be zero below their mass thresholds). The sumrules on this new 





and conserve baryon number and charge 




(5) 
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set are then imposed such that the original normalization of Uy and are 
preserved, but the rest of the flavours are scaled together so that Eq.(4) 
is fulfilled. In this study we accept the <few% normalization error which 
results from the fact that our x-range is not x = [0, 1]. From now on we 
call these type of PDF sets database PDFs. We randomly initialize a small 
5x5 map with these candidate database PDFs, such that each map vector 
Vi consists of the PDF set itself, and of the observables F^(x, Ql) derived 
from it. Next we train the map with A^step = 5 batch-training steps with 
training data that consists of 100 database PDFs plus 5 original "mother" 
PDF sets, which we will call init PDFs from now on. We choose the similarity 
criterion to be the similarity of observables i^Kx, Q^) with M^ata = Li. The 
similarity is tested at a number of x-values (equidistant in logarithmic scale 
up to a; ~ 0.2, and equidistant in linear scale above that) both at the initial 
scale and at all the evolved scales where experimental data exist. On every 
training, after the matching, all the observables (PDFs) of the map vectors 
get averaged with the observables (PDFs, flavor by flavor) matched within 
the neighbourhood according to Eq. (3). The resulting new averaged map 
vector PDFs are rescaled again (such that and d^ are scaled first) to obey 
the sumrules. From now on we will call these type of PDF sets map PDFs. 
The map PDFs are evolved and the observables at every experimental data 
scale are computed and compared for similarity with the observables from 
the training PDFs. 

After the training we have a map with 25 map PDFs and the same 105 PDF 
sets we used to train the map. The resulting LO SOM is shown in Fig. 5, 
with just F^{x,Qiys of each cell shown for clarity. The red curves in the 
figure are the map F2's constructed from the map PDFs, black curves are 
the database F2's and green curves are the init F2's constructed from the init 
PDFs (CTEQ6, CTEQ5, MRST02, Alekhin and GRV98 parametrizations) . 
It is obviously difficult to judge visually just by looking at the curves whether 
the map is good and fully trained. One hint about possible ordering may be 
provided by the fact that the shape of the curve must correlate with the 
X^/N against the experimental data. The distribution of the 'X^ /N values 
(no correlated systematic errors are taken into account for simplicity) of the 
map PDFs, shown in each cell , does indeed seem somewhat organized. 
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Figure 5: (Colour online) Trained 1. iteration LO map. Red curves are map -Ff Sj 
green curves init -Ff black curves rest of the database -Ff '^^^ number in 
each cell is the /N of the map PDF. 

Table 1 lists the jN values the original init PDFs obtain within the MIX- 
PDF frame'work described above. Comparison of these values with the values 
in Fig. 5 reveals that some of the map PDFs, as also some of the database 
PDFs, have gained a /N comparable to or better than that of the init 
PDFs. 

Inspired by the progress, we start a second fitting iteration by selecting the 
5 best PDF sets from the 25+5+100 PDF sets of the first iteration as our 
new init PDFs (which are now properly normalized after the first iteration) to 
generate database PDFs for a whole new SOM. Since the best PDF candidate 
from the first iteration is matched on this new map as an unmodified init 
PDF, it is guaranteed that the /N as a function of the iteration either 
decreases or remains the same. We keep repeating the iterations until the 
X^ /N saturates. Fig. 6 (Case 1) shows the x^ /N as a function of iterations 
for the best PDF on the trained map, for the worst PDF on the map and for 
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PDF* 


LO xVN 


NLO x^lN 


Alekhin 


3.34 


29.1 


CTEQ6 


1.67 


2.02 


CTEQ5 


3.25 


6.48 


CTEQ4 


2.23 


2.41 


MRST02 


2.24 


1.89 


GRV98 


8.47 


9.58 



Tabic 1: xV^ fo^' different MIXPDF input PDF sets against all the datasets 
used (HI, ZEUS, BCDMS, N=709). 

the worst of the 5 PDFs selected for the subsequent iteration as an init PDF. 
The final 'X^ /N of these runs are listed in Table 2 (first row) as Case 1 and 
Fig. 7 shows these results (black solid line), together with original starting 
sets at the initial scale (note the different scaling for gluons in LO and in 
NLO figures). For 10 repeated LO runs we obtain xV-/V=l-208 and (7=0.029. 
Let us now analyze in more detail how the optimization proceeds. Figs. 8,9 
show the LO maps also for the 2. and 3. iterations. On the first iteration the 
init PDFs, the shapes of which were taken from existing parametrizations, fall 
in the cells (0,1) (CTEQ5), (1,3) (Alekhin), (1,4) (MRST02), (2,3) (CTEQ6) 
and (3,0) (GRV98), so the modern sets, Alekhin, MRST02 and CTEQ6, 
group close to each other, i.e. the shapes of the obscrvablcs they produce are 
very similar, as expected. The best 5 PDFs selected as the 2. iteration init 
PDFs also come from this neighbourhood, 3 of them from the cell (1,3) and 
the other 2 from the cell (2,3). Two of these selected sets are map PDFs, 
two are database PDFs and also the original init PDF CTEQ6 survived for 
the 2. iteration. 

At the end of the 2. iteration the init PDFs, which originated from the 
neighbouring cells, are scattered in the cells (0,1), (0.3), (1,0) (CTEQ6), 

■l-These are the 'yd jN for the initial scale PDF sets taken from the quoted parametriza- 
tions and evolved with CTEQ6 DGLAP settings, the heavy flavours were set to be zero 
below their mass thresholds, no kinematical cuts or normalization factors for the experi- 
mental data were imposed, and no correlated systematic errors of the data were used to 
compute the /N. We do not claim these values to describe the quality of the quoted 
PDF sets. 
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(2,2) and (2,3) and the best 5 PDFs produced during this iteration are in 
the cells (4,1) (database PDF), (4,0) (map PDF), (3,2) (map PDF), (2,2) 
(database PDF) and (3,0) (map PDF). 

After the 3. iteration the above best 5 PDFs of 2. iteration are in the cells 
(2,0), (3,1), (0.2), (0.4) and (3,2) and the new best 5 PDFs produced are all 
map PDFs with 4 of them in neighbouring cell pairs. 

Map PDFs provide complicated linear combinations of the database PDFs 
and obviously play an important role in the algorithm. The size of the map 
dictates how much the neighbouring map vectors differ from each other. 
Since the PDFs in the same cell are not required to be similar, only the 
observables constructed from them cell or a neighbourhood may in 

principle contain a spread of PDFs with a spread of /N^s. However, since 
our selection criteria for the init PDFs was based on the best /N only, 
it is inevitable that the observables on the map become increasingly similar 
as the iterations go by, and the /N flattens very fast as can be seen from 
Fig. 6. As a consequence we quickly lose the variety in the shapes of the 
PDFs as the iterations proceed, and on the final iteration all the PDFs on 
the map end up being very similar. 
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SOM 


-^step 


# init 


# database 


Case 


LO xVN 


TV TT "X O / 71 T 

NLO X /N 


5x5 


5 


5 


100 


1 


1.19 


1.28 


5x5 


5 


5 


100 


2 


1.37 


1.44 


5x5 


5 


10 


100 


1 


1.16 


1.25 


5x5 


5 


10 


100 


2 


1.49 


1.43 


5x5 


5 


15 


100 


1 


1.16 


1.45 


5x5 


5 


20 


100 


1 


1.17 




5x5 


10 


10 


100 


1 


1.16 


1.30 


5x5 


40 


10 


100 


1 


1.20 




15x15 


5 


5 


900 


1 


1.22 




15x15 


5 


5 


900 


2 


1.31 




15x15 


5 


30 


900 


1 


1.16 


1.25 


15x15 


5 


30 


900 


2 


1.25 


1.53 



Table 2: x^/N against all the datasets used (HI, ZEUS, BCDMS) for some 
selected MIXPDF runs. 



The MIXPDF algorithm obviously has several other weaknesses too. Among 
them are how the result would change if we started with another first iteration 
PDF selection, and what are the effects of changing the size of the map, 
number of the database PDFs and init PDF sets and size of A^stcp? In general, 
how much the final result depends on the choices that we make during the 
fitting process? 

Let us now study some of these questions a bit. Since we have chosen our 
evolution settings to be those of CTEQ6's, it necessarily becomes a favoured 
set (although we don't impose any kinematical cuts on the experimental 
data). Therefore we performed another LO and NLO runs, with CTEQ6 
now replaced with CTEQ4 [44]. The results of these runs are reported in 
Table 2 (2. row) and in Fig. 6 (xViV) and Fig. 7 (the PDFs) as Case 2 . 
The Case 2 clearly produces worse results. Without an input from CTEQ6 
we automatically lose all the low gluons at small-a; -type of results in NLO, 
for example. 

Fig. 10 addresses the issue of choosing different A^step, for LO Case 1 and 
Case 2. The solid (dotted) fines show the best (worst) init PDF selected 
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Figure 7: (Colour online) MIXPDF results together with the input PDF sets. 



from each iteration for several A'^step selections. In Case 2 the best results 
are obtained with small number of training steps, whereas Case 1 does not 
seem to benefit from a longer SOM training. Keeping the stochastical nature 
of the process in our minds, we may speculate that the seemingly opposite 
behaviour for the Case 1 and Case 2 results from the fact that it is more 
probable to produce a good set of database PDFs in Case 1 than in Case 2. 
If the database is not so good to begin with, the longer training contaminates 
all the map PDFs with the low quality part of the database PDFs. 
Table 2 also showcases best results from a variety of MIXPDF runs where 
we have tried different combinations of SOM features. It is interesting to 
notice that increasing the size of the map and database does not necessarily 
lead to a better performance. Instead the number of the init PDFs on later 
iterations (always 5 for the first iteration) seem to be a key factor, it has 
to be sufficiently large to preserve the variety of database the PDFs but 
small enough to consist of PDFs with good quality. In our limited space of 
database candidates (~ 6^ = 7776 possibilities) the optimal set of variables 
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for the MIXPDF run should not be not impossible to map down. 
The method we used to generate the sample data is very simple indeed. The 
number of all the possible candidate database PDFs is not very large to begin 
with, so the quality of the final results strongly depends on the quality of 
the input for the SOM. Since the map PDFs are obtained by averaging with 
the training samples, and the non-valence flavours are scaled by a common 
factor when imposing the sumrules, the map PDFs tend to lie in between the 
init PDFs. Therefore map PDFs with extreme shapes are never produced, 
and thus never explored by the algorithm. 

A method which relies on sampling existing parametrizations on a SOM is 
inconvenient also because it is not readily applicable to multivariable cases. 
For the case of the PDFs it is sufficient to have a value for each flavour 
for a discrete set of a;-values, but for a multivariable cases, such as nPDFs, 
or GPDs the task of keeping track of the grid of values for each flavour in 
each X for several different values of the additional variables {e.g. A and Z 
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Figure 9: (Colour online) Trained 3. iteration LO map, curves and numbers as in 
Fig.5. 

for nuclei, and the skewness, ^ and the squared 4-momentum transfer, t for 
GPDs) is computationally expensive. In principle, SOM can keep track of 
the interrelations of the map vectors, and knowing the parametrization for 
the init PDFs, it would be possible to construct the parametrization for the 
map PDFs. That would, however, lead to very complicated parametrizations, 
and different nPDF, GPD etc. parametrizations are presently not even either 
measured or defined at the same initial scale. 

Despite of its problems, on a more basic level, MIXPDF does have the the 
desirable feature that it allows us to use SOM as a part of the PDF opti- 
mization algorithm in such a way that we cluster our candidate PDFs on 
the map, and select those PDFs which i) minimize a chosen fitness function, 
e.g. x^y when compared against experimental data, and ii) have some desired 
feature which can be visualized on the map or used as a clustering criterion. 
Therefore, in the following Section, we keep building on this example. 
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Figure 10: (Colour online) LO /N for 5 x 5 SOM runs with different A^'step- 



4 ENVPDF algorithm 

Most of the problems with the MIXPDF algorithm originate from the need 
to be able to generate the database PDFs in an unbiased way as possible, and 
at the same time to have a variety of PDF candidates available at every stage 
of the fitting procedure. Yet, one needs to have control over the features of 
the database PDFs that are created. 

To accomplish this, we choose, at variance with the "conventional" PDFs 
sets or NNPDFs, to give up the functional form of PDFs and rather to rely 
on purely stochastical methods in generating the initial and training samples 
of the PDFs. Our choice is a GA-type analysis, in which our parameters are 
the values of PDFs at the initial scale for each flavour at each value of x 
where the experimental data exist. To obtain control over the shape of the 
PDFs we use some of the existing distributions to establish an initial range, 
or envelope, within which we sample the database PDF values. 
Again, we use the Case 1 and 2 PDF sets (CTEQ6, CTEQ5, CTEQ4, 
MRST02, Alekhin and GRV98) as an initialization guideline. We construct 
our initial PDF generator first to, for each flavour separately, select ran- 
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domly either the range [0.5, 1], [1.0, 1.5] or [0.75, 1.25] times any of the Case 
1 (or 2) PDF set. Compared to MIXPDF algorithm we are thus adding more 
freedom to the scahng of the database PDFs. Next the initial generators gen- 
erate values for each x^atJ using uniform, instead of Gaussian, distribution 
around the existing parametrizations, thus reducing direct bias from them. 
Gaussian smoothing is applied to the resulting set of points, and the flavours 
combined to form a PDF set such that the curve is hnearly interpolated from 
the discrete set of generated points. 

The candidate PDF sets are then scaled to obey the sumrules as in MIXPDF 
algorithm. In order to obtain a reasonable selection of PDFs to start with, 
we reject candidates which have /N > 10 (computed as in MIXPDF algo- 
rithm). To further avoid direct bias from the Case 1 and 2 PDFs, we don't 
include the init PDFs into the training set for the first iteration as we did in 
MIXPDF case. For a. N x N SOM we choose the size of the database to be 

During the later iterations we proceed as follows: At the end of each iteration 
we pick from the trained x SOM 2A^ best PDFs as the init PDFs. These 
init PDFs are introduced into the training set alongside with the database 
PDFs, which are now constructed using each of the init PDFs in turn as a 
center for a Gaussian random number generator, which assigns for all the 
fiavours for each x a value around that same init PDF such that 1 — cr of 
the generator is given by the spread of the best PDFs in the topologically 
nearest neighbouring cells. The object of these generators is thus to refine a 
good candidate PDF found in the previous iteration by jittering it's values 
within a range determined by the shape of other good candidate PDFs from 
the previous iteration. The generated PDFs are then smoothed and scaled 
to obey the sumrules. Sets with x^/A^ > 10 are always rejected. We learnt 
from the MIXPDF algorithm that it is important to preserve the variety of 
the PDF shapes on the map, so we also keep Aorig copies of the first iteration 
generators in our generator mix. 

Table 3 lists results from a variety of such runs. The results do not seem 

§To ensure a reasonable large- a; behaviour for the PDFs, we also generate with the same 
method values for them in a few x-points outside the range of the experimental data. We 
also require the PDFs, the gluons especially, to be positive for simplicity. 
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to be very sensitive to the number of SOM training steps, A^step; but are 
highly sensitive to the number of first iteration generators used in subse- 
quent iterations. Although the generators can now in principle produce an 
infinite number of different PDFs, the algorithm would not be able to radi- 
cally change the shape of the database PDFs without introducing a random 
element on the map. Setting iVong > provides, through map PDFs, that 
element, and keeps the algorithm from getting fixed to a local minimum. The 
ENVPDF algorithm is now more independent from the initial selection of the 
PDF sets. Case 1 or 2, than MIXPDF, since no values of e.g. the CTEQ6 
set in the original generator are ever introduced on the map directly. 
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Table 3: xV^ against all the datasets used (HI, ZEUS, BCDMS) for variety 
of ENVPDF runs. 

Fig. 11 shows the ')^/N as a function of iteration for 5x5 LO and NLO, 
both Case 1 and Case 2, runs, where A^step — 5 and A^orig = 2. Clearly 
the ENVPDF runs take multiple number of iterations for the x^ /N to level 
compared to the MIXPDF runs, and they are therefore more costly in time. 
With the ENVPDF algorithm, however, the keeps on slowly improving 

even after all the mother PDFs from the same iteration are equally good fits. 
For a larger 15x15 SOM the number of needed iterations remains as large. 
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Figure 11: (Colour online) /N oi the ENVPDF runs as a function of the itera- 
tion. 

Fig. 12 shows some of the Case 1 LO ENVPDF results at the initial scale 
(5 = 1-3 GeV (left panel), and evolved up to Q = 3 GeV (right panel). 
The reference curves shown are also evolved as in MIXPDF. Although the 
initial scale ENVPDF results appear wiggly, they smooth out soon because 
of the additional well known effect of QCD evolution. In fact, the initial scale 
curves could be made smoother by applying a stronger Gaussian smoothing, 
but this is not necessary, as long as the starting scale is below the Q™™ of 
the data. The evolved curves preserve the initially set baryon number scaling 
within 0.5% and momentum sumrule within 1.5% accuracy. Also, the results 
obtained from a larger map tend to be smoother since the map PDFs get 
averaged with a larger number of other PDFs. Studying the relation between 
the redundant wiggliness of our initial scale PDFs and possible fitting of 
statistical fluctuations of the experimental data is beyond the scope of this 
paper. The NLO Case 1 and 2 results are presented in Fig. 13. The trend 
of the results is clearly the same as in MIXPDF case, CTEQ6 is a favoured 
set, and especially the PDFs with gluons similar to those of CTEQ6's have 
good ■ 
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We did not study the effect of modifying the width or the shape of the 
envelope in detail here, but choosing the envelope to be the wider or narrower 
than 1 — 0" for the Gaussian generate seem to lead both slower and poorer 
convergence. Also, since we are clustering on the similarity of the observables, 
the same cell may in theory contain the best PDF of the iteration and PDFs 
which have /N as large as 10. Therefore the shape of the envelope should 
be determined only by the curves with promising shapes. 




Figure 12: (Colour online) LO ENVPDF results at the initial scale, and at Q = 3.0 
GeV. 



Next we want to study the PDF uncertainty using the unique means the 
SOMs provide to us even for of PDFs without a functional form. 

Since we have only used DIS data in this introductory study, we are only 
able to explore the small-x uncertainty for now. Figs. 14 (LO) and 15 (NLO) 
showcase, besides our best results, the spread of all the initial scale PDFs 
with /N < 1.2, that were obtained during a 5 x 5 (left panel) and 15 x 15 
(right panel) SOM run. Since the number of such PDF sets is typically of the 
order of thousands, we only plot the minimum and maximum of the bundle 
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Figure 13: (Colour online) NLO ENVPDF results at the initial scale, and at 
Q = 3.0 GeV. 



of curves. Since the total number of experimental datapoints used is ~ 710, 
the spread Ax^/N ~ 0.2 corresponds to a Ax^ ~ 140. Expectedly, the small- 
X gluons obtain the largest uncertainty for all the cases we studied. Even 
though a larger SOM with a larger database might be expected to have more 
variety in the shapes of the PDFs, the /N < 1.2 spreads of the 5x5 and 
15 X 15 SOMs are more or less equal sized (the apparent differences in sizes 
ai Q = Qq even out when the curves are evolved). Both maps therefore end 
up producing the same extreme shapes for the map PDFs although a larger 
map has more subclasses for them. Remarkably then, a single SOM run can 
provide a quick uncertainty estimate for a chosen Ax^ without performing a 
separate error analysis. 

Due to the stochastical nature of the ENVPDF algorithm, we may well also 
study the combined results from several separate runs. It is especially impor- 
tant to verify the stability of our results, to show that the results are indeed 
reproducible instead of lucky coincidences. Left panels of Figs 16 (LO) and 
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Figure 14: (Colour online) LO ENVPDF best result and the x^/N < 1.2 spread 
of results. 



17 (NLO) present the best results, and the combined /N < 1.2 spreads 
for 10 repeated 5x5, N^tep = 5 runs at the initial scale. The average x^/-^ 
and the standard deviation a for these runs are in LO (NLO) are 1.065 and 
0.014 (1.122 and 0.029), corresponding to Ax^ ~ 10 (20) for LO (NLO). 
The right panels of the same Figs 16, 17 show the 10 best result curves and 
the x^/^ < 1-2 spreads evolved up to Q = 3.0 GeV. Clearly the seemingly 
large difference between the small-x gluon results at the initial scale is not 
statistically significant, but smooths out when gluons are evolved. Thus the 
initial scale wiggliness of the PDFs is mainly only a residual effect from our 
method of generating them and not linked to the overtraining of the SOM, 
and we refrain from studying cases where stronger initial scale smoothing 
is applied. Therefore our simple method of producing the candidate PDFs 
by jittering random numbers inside a predetermined envelope is surprisingly 
stable when used together with a complicated PDF processing that SOMs 
provide. 
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Figure 15: (Colour online) NLO ENVPDF best result and the ^x^jN < 1.2 spread 
of results. 



5 Future of the SOMPDFs 

So far we have shown a relatively straightforward method of obtaining stochas- 
tically generated, parameter- free, PDFs, with an uncertainty estimate for a 
desired Ax^. On every iteration using our competitive learning algorithm, 
the selection of the winning PDFs was based on the /N alone, and the 
fitting procedure was fully automated. In our MIXPDF algorithm the SOMs 
were used merely as a tool to create new combinations, map PDFs, of our in- 
put database. The ENVPDF algorithm also used the topology of the map to 
determine the shape of the envelope, within which we sampled the database 
PDFs. 

We reiterate that our initial study was aimed at observing and recording the 
behavior of the SOM as an optimization tool. Many of the features of our 
results could not in fact be predicted based on general assumptions. The 
proposed method can be extended much further than that. The automated 
version of the algorithm could be set to sample a vector consisting of PDF 
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Figure 16: (Colour online) LO ENVPDF best results and the \^ /N < 1.2 spreads 
of results from 10 separate runs. 



parameters, instead of values of PDFs in each value of x of the data. That 
would lead to smooth, continuous type of solutions, either along the lines 
of global analyses, or NNPDFs using N SOMs for Monte-Carlo sampled 
replicas of the data. Since the solution would be required to stay within an 
envelope of selected width and shape given by the map, no restrictions for 
the parameters themselves would be required. For such a method, all the 
existing error estimates, besides an uncertainty band produced by the map, 
would be applicable as well. 

What ultimately sets the SOM method apart from the standard global anal- 
yses or NNPDF method, however, are the clustering and visualization possi- 
bilities that it offers. Instead of setting Mdata = Li and clustering according 
to the similarity of the observables, it is possible to set the clustering crite- 
ria to be anything that can be mathematically quantified, e.g. the shape of 
the gluons or the large-x behaviour of the PDFs. The desired feature of the 
PDFs can then be projected out from the SOM. Moreover, by combining the 
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X X 
Figure 17: (Colour online) NLO ENVPDF best results and the xV^ < 1-2 
spreads of results from 10 separate runs. 



method with an interactive graphic user interface (GUI), it would be pos- 
sible to change and control the shape and the width of the envelope as the 
minimization proceeds, to guide the process by applying researcher insight at 
various stages of the process. Furthermore, the uncertainty band produced 
by the SOM as the run proceeds, could help the user to make decisions about 
the next steps of the minimization. With GUI it would be e.g. possible to 
constrain the extrapolation of the NN generated PDFs outside the x-range 
of the data without explicitly introducing terms to ensure the correct small- 
and large- a; behaviour as in NNPDF method (see Eq.(87) in [23]). The se- 
lection of the best PDF candidates for the subsequent iteration could then 
be made based on the user's preferences instead of solely based on the 1^ ■ 
That kind of method in turn could be extended to multivariable cases such 
as nPDFs and even GPDs and other not so well-known cases, where the data 
is too sparse for stochastically generated, parameter-free, PDFs. 
Generally, any PDF fitting method involves a large number of flexible points 
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"opportunities for adapting and fine tuning", which act as a source for 
both systematical and theoretical bias when fixed. Obvious optimization 
method independent sources of theoretical bias are the various parameters 
of the DGLAP equations, inclusion of extra sources of Q^-dependence be- 
yond DGLAP-type evolution and the data selection, affecting the coverage 
of different kinematical regions. SOMs themselves, and different SOMPDF 
algorithm variations naturally also introduce flexible points of their own. We 
explored a httle about the effects of choosing the size of the SOM and the 
number of the batch training steps A'step. There are also plenty of other 
SOM properties that can be modified, such as the shape of the SOM itself. 
We chose to use a rectangular lattice, but generally the SOM can take any 
shape desired. For demanding vizualisation purposes a hexagonal shape is 
an excellent choice, since the meaning of the nearest neighbours is better 
defined. 

The SOMPDF method, supplemented with the use of a GUI, will allow us 
to both qualitatively and quantitatively study the flexible points involved 
in the PDFs fitting. More complex hadronic matrix elements, such as the 
ones defining the GPDs, are natural candidates for future studies of cases 
where the experimental data are not numerous enough to allow for a model 
independent fitting, and the guidance and intuition of the user is therefore 
irreplaceable. The method we are proposing is extremely open for user in- 
teraction, and the possibilities of such a method are widely unexplored. 
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